library(car)
Loading required package: carData
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.8 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::recode() masks car::recode()
✖ purrr::some() masks car::some()
library(modelr)
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
build our model: best predictor first
mod1a <- lm(prestige~ education, data = prestige_trim)
summary(mod1a)
Call:
lm(formula = prestige ~ education, data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-21.605 -6.151 0.366 6.565 17.540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.8409 3.5285 -3.072 0.00276 **
education 5.3884 0.3168 17.006 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.578 on 96 degrees of freedom
Multiple R-squared: 0.7508, Adjusted R-squared: 0.7482
F-statistic: 289.2 on 1 and 96 DF, p-value: < 2.2e-16
library(ggfortify)
autoplot(mod1a)
mod1b <- lm(prestige ~ type, data = prestige_trim)
summary(mod1b)
Call:
lm(formula = prestige ~ type, data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-18.2273 -7.1773 -0.0854 6.1174 25.2565
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.527 1.432 24.810 < 2e-16 ***
typeprof 32.321 2.227 14.511 < 2e-16 ***
typewc 6.716 2.444 2.748 0.00718 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.499 on 95 degrees of freedom
Multiple R-squared: 0.6976, Adjusted R-squared: 0.6913
F-statistic: 109.6 on 2 and 95 DF, p-value: < 2.2e-16
autoplot(mod1b)
prestige_resid <- prestige_trim %>%
add_residuals(mod1a) %>%
select(-c(prestige, education))
prestige_resid %>%
select(resid, everything()) %>%
ggpairs(aes(colour = type, alpha = 0.5))
add second predictor: next best explains the most of residual error
mod2a <- lm(prestige ~ education + ln_income,
data = prestige_trim)
summary(mod2a)
Call:
lm(formula = prestige ~ education + ln_income, data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-16.6775 -4.0506 -0.2538 4.0648 16.7992
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -101.1882 12.8490 -7.875 5.51e-12 ***
education 4.0375 0.3173 12.726 < 2e-16 ***
ln_income 12.0564 1.6719 7.211 1.33e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.932 on 95 degrees of freedom
Multiple R-squared: 0.8389, Adjusted R-squared: 0.8356
F-statistic: 247.4 on 2 and 95 DF, p-value: < 2.2e-16
autoplot(mod2a)
mod2b <- lm(prestige ~ education + income,
data = prestige_trim)
summary(mod2b)
Call:
lm(formula = prestige ~ education + income, data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-16.9367 -4.8881 0.0116 4.9690 15.9280
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.6210352 3.1162309 -2.446 0.0163 *
education 4.2921076 0.3360645 12.772 < 2e-16 ***
income 0.0012415 0.0002185 5.682 1.45e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.45 on 95 degrees of freedom
Multiple R-squared: 0.814, Adjusted R-squared: 0.8101
F-statistic: 207.9 on 2 and 95 DF, p-value: < 2.2e-16
autoplot(mod2b)
mod2c <- lm(prestige ~ education + type,
data = prestige_trim)
summary(mod2c)
Call:
lm(formula = prestige ~ education + type, data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-19.410 -5.508 1.360 5.694 17.171
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.6982 5.7361 -0.470 0.6392
education 4.5728 0.6716 6.809 9.16e-10 ***
typeprof 6.1424 4.2590 1.442 0.1526
typewc -5.4585 2.6907 -2.029 0.0453 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.814 on 94 degrees of freedom
Multiple R-squared: 0.7975, Adjusted R-squared: 0.791
F-statistic: 123.4 on 3 and 94 DF, p-value: < 2.2e-16
mod2a gave the biggeres uplift in r2, residuals look great, add it in check significance with anova
anova(mod1a, mod2c)
Analysis of Variance Table
Model 1: prestige ~ education
Model 2: prestige ~ education + type
Res.Df RSS Df Sum of Sq F Pr(>F)
1 96 7064.4
2 94 5740.0 2 1324.4 10.844 5.787e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
it is statistically significant (the 3 levels have different means, so we can include type in the model with education)
3rd predictor
prestige_resid <- prestige_trim %>%
add_residuals(mod2a) %>%
select(-c(prestige, education, ln_income))
prestige_resid %>%
select(resid, everything()) %>%
ggpairs(aes(colour = type, alpha = 0.5))
mod3a <- lm(prestige ~ education + ln_income + women,
data = prestige_trim)
summary(mod3a)
Call:
lm(formula = prestige ~ education + ln_income + women, data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-16.8202 -4.7019 0.0696 4.2245 17.6833
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -129.16790 18.95716 -6.814 8.97e-10 ***
education 3.59404 0.38431 9.352 4.39e-15 ***
ln_income 15.60547 2.43246 6.416 5.62e-09 ***
women 0.06481 0.03270 1.982 0.0504 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.828 on 94 degrees of freedom
Multiple R-squared: 0.8454, Adjusted R-squared: 0.8405
F-statistic: 171.4 on 3 and 94 DF, p-value: < 2.2e-16
mod3b <- lm(prestige ~ education + ln_income + type,
data = prestige_trim)
summary(mod3b)
Call:
lm(formula = prestige ~ education + ln_income + type, data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-13.511 -3.746 1.011 4.356 18.438
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -81.2019 13.7431 -5.909 5.63e-08 ***
education 3.2845 0.6081 5.401 5.06e-07 ***
ln_income 10.4875 1.7167 6.109 2.31e-08 ***
typeprof 6.7509 3.6185 1.866 0.0652 .
typewc -1.4394 2.3780 -0.605 0.5465
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.637 on 93 degrees of freedom
Multiple R-squared: 0.8555, Adjusted R-squared: 0.8493
F-statistic: 137.6 on 4 and 93 DF, p-value: < 2.2e-16
anova(mod2a, mod3b)
Analysis of Variance Table
Model 1: prestige ~ education + ln_income
Model 2: prestige ~ education + ln_income + type
Res.Df RSS Df Sum of Sq F Pr(>F)
1 95 4565.4
2 93 4096.3 2 469.07 5.3247 0.006465 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
diagnostic plot fine, anova fine, let’s add type
interactions between existing main effects
prestige_resid <- prestige_trim %>%
add_residuals(mod3b) %>%
select(-prestige)
mod4a <- lm(prestige ~ education + ln_income + type +
education : ln_income,
data = prestige_trim)
summary(mod4a)
Call:
lm(formula = prestige ~ education + ln_income + type + education:ln_income,
data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-13.6129 -4.1461 0.7695 4.1546 17.9453
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -166.7630 51.0506 -3.267 0.001530 **
education 11.1444 4.5601 2.444 0.016438 *
ln_income 20.2741 5.8790 3.449 0.000852 ***
typeprof 6.8626 3.5803 1.917 0.058374 .
typewc -2.3516 2.4103 -0.976 0.331796
education:ln_income -0.8892 0.5114 -1.739 0.085414 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.566 on 92 degrees of freedom
Multiple R-squared: 0.8601, Adjusted R-squared: 0.8525
F-statistic: 113.1 on 5 and 92 DF, p-value: < 2.2e-16
mod4b <- lm(prestige ~ education + ln_income + type +
education : type,
data = prestige_trim)
summary(mod4b)
Call:
lm(formula = prestige ~ education + ln_income + type + education:type,
data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-13.962 -3.797 1.041 4.092 16.503
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -81.6672 14.3681 -5.684 1.57e-07 ***
education 3.1407 0.9004 3.488 0.000751 ***
ln_income 10.6833 1.7108 6.245 1.33e-08 ***
typeprof 15.6176 14.2168 1.099 0.274871
typewc -30.4466 18.3465 -1.660 0.100451
education:typeprof -0.5801 1.2211 -0.475 0.635887
education:typewc 2.6675 1.7551 1.520 0.132018
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.585 on 91 degrees of freedom
Multiple R-squared: 0.8608, Adjusted R-squared: 0.8516
F-statistic: 93.79 on 6 and 91 DF, p-value: < 2.2e-16
mod4c <- lm(prestige ~ education + ln_income + type +
ln_income : type,
data = prestige_trim)
summary(mod4c)
Call:
lm(formula = prestige ~ education + ln_income + type + ln_income:type,
data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-13.484 -4.453 1.122 4.123 18.737
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -118.4325 20.3728 -5.813 8.97e-08 ***
education 3.2107 0.5993 5.357 6.31e-07 ***
ln_income 14.9336 2.4928 5.991 4.12e-08 ***
typeprof 82.7757 31.5059 2.627 0.0101 *
typewc 51.3717 36.8521 1.394 0.1667
ln_income:typeprof -8.5690 3.5251 -2.431 0.0170 *
ln_income:typewc -6.1925 4.3172 -1.434 0.1549
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.491 on 91 degrees of freedom
Multiple R-squared: 0.8647, Adjusted R-squared: 0.8558
F-statistic: 96.96 on 6 and 91 DF, p-value: < 2.2e-16
anova(mod3b, mod4c)
Analysis of Variance Table
Model 1: prestige ~ education + ln_income + type
Model 2: prestige ~ education + ln_income + type + ln_income:type
Res.Df RSS Df Sum of Sq F Pr(>F)
1 93 4096.3
2 91 3834.2 2 262.13 3.1107 0.04934 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
calc.relimp(mod4b, type = "lmg", rela = TRUE)
Response variable: prestige
Total response variance: 292.2358
Analysis based on 98 observations
6 Regressors:
Some regressors combined in groups:
Group type : typeprof typewc
Group education:type : education:typeprof education:typewc
Relative importance of 4 (groups of) regressors assessed:
type education:type education ln_income
Proportion of variance explained by model: 86.08%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
type 0.357978988
education:type 0.005688181
education 0.409114479
ln_income 0.227218351
Average coefficients for different model sizes:
1group 2groups 3groups 4groups
education 5.388408 4.305159 4.024068 3.1406913
ln_income 24.619472 12.879804 10.487471 10.6833411
typeprof 32.321114 14.810837 12.807271 15.6176298
typewc 6.716206 1.013706 -12.911341 -30.4465888
education:typeprof NaN NaN -0.980805 -0.5800772
education:typewc NaN NaN 1.670938 2.6675490
library(lm.beta)
mod4c_betas <- lm.beta(mod4c)
summary(mod4c_betas)
Call:
lm(formula = prestige ~ education + ln_income + type + ln_income:type,
data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-13.484 -4.453 1.122 4.123 18.737
Coefficients:
Estimate Standardized Std. Error t value Pr(>|t|)
(Intercept) -118.4325 NA 20.3728 -5.813 8.97e-08 ***
education 3.2107 0.5163 0.5993 5.357 6.31e-07 ***
ln_income 14.9336 0.4557 2.4928 5.991 4.12e-08 ***
typeprof 82.7757 2.2634 31.5059 2.627 0.0101 *
typewc 51.3717 1.2801 36.8521 1.394 0.1667
ln_income:typeprof -8.5690 -2.1495 3.5251 -2.431 0.0170 *
ln_income:typewc -6.1925 -1.3066 4.3172 -1.434 0.1549
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.491 on 91 degrees of freedom
Multiple R-squared: 0.8647, Adjusted R-squared: 0.8558
F-statistic: 96.96 on 6 and 91 DF, p-value: < 2.2e-16